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Abstract 

We present symmetry tests for bifurcating autoregressive processes (BAR) when some data are missing. BAR pro- 
cesses typically model cell division data. Each cell can be of one of two types odd or even. The goal of this paper is 
to study the possible asymmetry between odd and even cells in a single observed lineage. We first derive asymmetry 
tests for the lineage itself, modeled by a two-type Galton- Watson process, and then derive tests for the observed BAR 
process. We present applications on both simulated and real data. 
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1. Introduction 



Bifurcating autoregressive processes (BAR) were first introduced by Cowan and Staudte ( 1986 1. They generalize 
autoregressive processes when data are structured as a binary tree. Typically, they are involved in statistical studies of 
cell lineages, since each cell in one generation gives birth to two offspring in the next one. These two off'spring some- 
times need to be distinguished according to some biological property, leading to the notion of type. Cell lineage data 
consist of observations of some quantitative characteristic of the cells (e.g. their growth rate) over several generations 
descended from an initial cell. More precisely, the initial cell is labelled 1, and the two offspring of cell k are labelled 
2k and 2A; -i- 1, where 2k stands for one type, thus called even, and 2k + I for the other type, thus called odd. If Xf. 
denotes the quantitative characteristic of cell k, then the first-order asymmetric BAR process is given by 



X2k ^ a + bXk + eik, 
X2k+\ - c -¥ dXk -¥ eik+i. 



(1) 



for all ^ > 1. The noise sequence (E2k,B2k+\) represents environmental effects, while a,b,c,d are unknown real 
parameters related to the inherited effects. They are allowed to be different for the odd and even sisters. 



Various estimators are studied in the literature for the unknown parameters a, b, c, d, see (| Bercu et al.| 2009 



Delmas and Marsalle 2010 Guyon 2007| ). This paper derives further properties of the estimators given in ( de Saporta 
et al.)|2011| ), where the genealogy is modeled by a two-type Galton Watson process (GW), allowing the reproduction 
laws to depend on both the mother's and daughter's types. Indeed, the aim of this paper is to propose asymmetry tests 
for both the two-type GW process defining the genealogy of the cells, as well as the BAR process with missing data. 
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More precisely, we first study the difference of the means of the reproduction laws for even and odd mother cells in 
the GW process. Then we investigate the difference of the parameters a,b and c,d and the difference between the 
fixed points a/(l - b) and c/(l - d). We propose Wald's type tests based on the asymptotic normality of the various 



estimators. A detailed study on simulated data, as well as a new investigation of the Escherichia coli data of Stewart 
et al. ( 2005 [ l are provided. 



The paper is organized as follows. We start with introducing in Section[2]some notation that will be used through- 
out the paper In Section|3] we derive Wald's tests for the two-type GW process describing the genealogy of the cells. 
In Section|4] we derive Wald's test for the BAR process, to test asymmetry of its parameters. In section|5]we apply 
our tests to simulated data. Finally, in Sectionl6]we apply our tests to (Stewart et al. 2005 1 data. 



2. Notation 

For all n > 1, denote the n-th generation by G„ ^ {k e N|2" <k< 2"+' - 1). Denote by T„ = U"=o 'Gf, the sub-ti-ee 
of all cells up to the n-th generation. The cardinality |G„| of G„ is 2", while that of T„ is |T„| = 2"^' - 1. We need to 
distinguish the cells in G„ and T„ according to their type. The type even will be labelled type and the type odd will 
be labelled type 1. We set = G„ n (2N), G' = G„ n (2N + 1), Tj] = T„ n (2N), = T„ n (2N -h 1). We encode 
the presence or absence of the cells by the process (5^^): if cell k is observed, then Sk - 1, if cell k is missing, 6k = 0. 
We define the sets of observed cells as - {k e G„ : 6k - 1) and T* = {A- e T„ : 5^ = 1). Finally, let S be the 
event corresponding to the cases when there are no cell left to observe in the current generation: & - \J„>i{\Gil\ - 0) 
and £ the complementary set of S. For « > 1, we define the number of observed cells among the n-th generation, 
distinguishing according to their type: = |G;; n 2N|, = |G* n (2N + 1)|, and we set, for all n > 1, Z„ = (Z^\Z„i). 
Note that for / e {0, 1) and n > 1 one has Z', = 2teG„_i ^ik+i- 



3. Asymmetry in the lineage 

We now describe the mechanism generating the observation process {6k)- The resulting process (Z„) is a two-type 
GW process. We recall some assumptions similar to (de Saporta et al. 201 l| l and mostly taken from (Harris 1963[ l. 



3.1. Model and assumptions 

We define the cells genealogy by a two-type GW process (Z„). All cells reproduce independently and with a 
reproduction law depending only on their type. For a mother cell of type / € {0, 1), we denote by p'''\jo,ji) the 
probability that it has jo daughter of type and ji daughter of type 1. For the cell division process we are interested 
in, one clearly has p^'\Q, 0) + p^'\l,Q) + p^'\Q, 1) + p^'\l, 1) = 1. The reproduction laws have moments of all order, 
and we can thus define the descendants matrix P 



P = 



Poo Poi 
Pio Pn 



where p,o = P^'Ki,0) + p^'\l, 1) and pn - p^'\0, l) + p^'\l, 1), for / € {0, 1): /7,y is the expected number of descendants 
of type j of a cell of type /. It is well-known that when all the entries of the matrix P are positive, P has a strictly 
dominant positive eigenvalue, denoted n, which is also simple. We make the following main assumption. 

(AO) All entries of the matrix P are positive and the dominant eigenvalue is greater than one: tt > 1 . 

In this case, there exist left and right eigenvectors for tt which are component- wise positive. Let z - {z^,z^) be 
such a left eigenvector satisfying -H z' = l.This dominant eigenvalue tt is related to the extinction of the process: 
assumption (AO) means that the GW process (Z„) is super-critical, and ensures that extinction is not almost sure: 
P(£) < 1. Besides, on fi, |T*|"' converges to z', meaning that z' is the asymptotic proportion of cells of type i 

in a given sub-tree. 
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3.2. Asymmetry test 

We first propose estimators for the parameters of the GW process and study their asymptotic properties. Our 
context is very specific because the available information given by (6k) is more precise than the one given by (Z„) 
usually used in the literature, see e.g. (Guttorp 1991)1. The empiric estimators of the reproduction probabiUties using 
data up to the n-th generation are then, for Jq, ji in {0, 1 ) 



Yuke'f„_2 ^2k+i'l>j(,iS2(2k+i))4>ji(S2(2k+i)+l) 
ZteT„_2 ^2k+i 



where <po(x) = I - x, <p\{x) - x, and if the denominators are non zero (the estimation is zer o otherwise). The strori g 



consistency is easily obtained on the non-extinction set fi e.g. by martingales methods as in (de Saporta et al. 



2011 



Lemma 3.1. Under (AO) and for all i, jo and j\ in {0, 1), one has lim„_,oo l(|G,*|>o)/'i''(yoi j\) — ^£,p'''\jo, ji) ci.s. 

Setp*'' = (/?®(0,0), /?*''( l,0),/?'^''(0, l),p^'\l, 1))' the vector ofthe 4 reproduction probabilities for a mother of type 
/, p = ((p^°^)',(p*'')')' the vector of all 8 reproductions probabilities and p„ - (^,"*(0,0), . . .,pi^\l, 1))' its estimator. 
As P(£) ^ 0, we define the conditional probability by V^A) = P(A n £)/P(£) for all event A. 



Theorem 3.2. Under assumption (AO), we have the convergence -y/|T*_j|(p„ - p) — » A/'(0, V) on (£, Pg-), with 











and for all i in {0, 1 ), V = W - p*'^(p^''')', W is a 4x4 matrix with the entries ofp^'^ on the diagonal and elsewhere. 
Proof : For all n > 2, and q > 1, set 

- n(0)r 



" q 



1 

ll k=\ 



M(l-5«)(l-'54*+i)-p<°'(0,0)) 

62kiS4k(l-S4k^l)-p''^\hO)) 

M(i-^«)5«+i-p*°*(o,i)) 

62kiS4k64k+l-p^''\l,l)) 
S2k+li(l - W)(l - S4k+3) - P*"(0,0)) 
S2k^l{S4k+2(l - S4k^3) - P'^'KUO)) 
52i+l((l -5«+2)<5«+3-p*'\0,l)) 
S2k+l(S4k+2S4k+3 - p^^\l,l)) 



Let be the filtration of cousin cells: = cr{6i,62,S3} and for all ^ > 1, ff^j - @q-\ V cr{S4q, S4q+\, 64q+2, ^4^+3} - 



For all « > 2, (M^'") is a (^^)-martingale with finite moments of all order. We apply Theorem 3. II. 10 of dDuflo 



19971 



to this sequence of martingales and with the stopping times v„ - |T„_2l- The Pg a.s. limit of the increasing process is 



IT* ,1 



jkeJ,,- 



62ky° 

ZkeT.,^,_ 52k^X 



^ G = 



^OyO 









In addition, the Lindeberg condition holds as the Sk have finite moments of all order. Thus, we obtain the convergence 



M 



(«) 

|T„-2l 



7V(0, G) on (fi, Pg). On the other hand, A;lj |T;_j |M[;_^^| = ^rj(p„ - p), with 



Zi"=iZ^l4 



and I4 is the identity matrix of size 4. As |T*| ' YIc=\ '^'t converges almost surely to z' on (£, P^), we have the asymp- 
totic normality as announced, using Slutsky's lemma. □ 



Using the asymptotic normality of the 'p^,l\jo, ji), we can derive Wald's test for the asymmetry of the means of the 
reproduction laws. Set m = (p*"'(l,0) + p<">(0, 1) -1- 2p^°>(l, 1)) - (p<'>(l,0) + /?<"(0, 1) + Ip^^Hl, 1)) the diff^erence of 
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the means of the types and 1 reproduction laws andm„ = (^„°'(l,0)+^,''^(0, 1) + 2^„°^(1, l))-(^„"(l,0)+^„'^(0, 1) + 
2p^n\l, 1)) its empirical estimator. Set H^^: m - the symmetry hypothesis and H^^: m 0. Let (Y^^)^ be the 
test statistic defined by Y^^ = \Tl_^\^'^(A^^)-^'^m„, where A^**' = dgw'V„dgw, dgw = (0, 1, 1,2, 0- 1,-1,-2)', and 
V„ is the empirical version of V, where z' is replaced by |T*|"' -ZJ and the p^'\jo, ji) are replaced by /?|,''(jo, ji)- 
V„ converges a.s. to V and the test statistic has the following asymptotic properties. 



Thanks to Lemma 



3.1 



Theorem 3.3. Under assumption (AO) and the null hypothesis HS" , one has (F„^'*')2 x\^) on (£,Pg); and 
under the alternative hypothesis H^^, one has lim„_,oo(i'^**')^ = +°° a.s. on (£, Pg). 

Proof : Let g be the function defined from onto Mby g(j[:i,X2, x-i,X4 ,xs, X(„x-i, jcg) - {x2+XT,+2xji)—{x(,+x-i+2x^), so 
that m„-m - g(p„)-g(p), and dgw is the gradient of g. Thus, Theoremjs^yields ^jW*^_^\{g(^„)-g(9)) — * N{0, dfi^) 

on (£, Pg), with A^'*' = dgw'Vdgw. Under the null hypothesis H^^^, g(p) = m = 0, so that \Tl_^\{A^'^)-^ g(^„f 

X^i^) on (£, Pg). One then uses Slutsky's lemma to replace A'^'^ by its estimator A^*^ and obtain the first convergence. 
Under the alternative hypothesis Hj"^, since Y^^ - |T* j|'^^(A|;''*')"'^^m„ and m„ converges to m ?t 0, Y^^ tends to 
infinity a.s. on (£, Pg-) □ 



4. Asymmetry in cells characteristic 

We now turn to the study of asymmetry of the BAR model with missing data. We first recall some asymptotic 
results on the estimation of the BAR parameters proved in (de Saporta et al. 201 l| l. 



4.1. Model and assumptions 

We consider the first-order asymmetric BAR process given by Eq. (|T|). We assume that E[X\] < oo. Moreover the 
parameters satisfy < max(|/7|, \d\) < 1. Denote by F = {Tn) the natural filtration associated with the BAR process: 
Tn - o-{Xk, k e T„). In all the sequel, we shall make use of the following moment and independence hypotheses. 

(AN.l) For all n > and for all k € G„+i, Ek belongs to with sup„>Q sup^^gg^^^ E[e^|;F„] < oo a.s. Moreover, there 
exist 0-2 > and p such that V« > and A: € G„+i, E[eA.|:r„] = 0, E[e^|:r„] - a.s.; Vn > 0,Vk ^ I e G„+i 
with [k/2] = [1/2], KlsicSil'Fn] = p a.s., where [x] denotes the largest integer less than or equal to x. 

(AN.2) For all n > the random vectors {{s2k, S2k+i), k e G„} are conditionally independent given T^n- 

(AI) The sequence {6k) is independent from the sequences (Xk) and (sk). 

The least-squares estimator of = (a, b, c, d)' is given for all n > 1 by 6,, = (d„, b„,'c„, d„y with 



9« = Yu 



keT„- 



Slk^lk 
Slk^k^lk 
Slk+\^2k+\ 
Slk+A^lk+l J 



1 Xk 
Xk x} 



Denote also L", L ', L°'' the a.s. limits: lim„^oc 1||g;i>())S;,/|T*| = l^L', lim„^oo li|G;,|>0!S°-'/|T;| ^ lgL °-' (see 
Proposition 4.2 of (de Saporta et al. 201 \) ). We now recall Theorems 3.2 and 3.4 of ( |de Saporta et al. 201 1 1. 



Theorem 4.1. Under (AN.1-2), (AO) and (AI), the estimator 0„ is strongly consistent lim„_ 
In addition, we have the asymptotic normality ^JfP~[\{6n - 6) — > W(0, H^'TS"') on (£, Pg-), where 








L' 



and r = 



4 



4.2. Asymmetry tests 



Using Theorem 4.1 we now propose two different asymmetry tests. The first one compares the couples (a, b) and 
(c, d). Set HJj: {a, b) - (c, d) the symmetry hypothesis and HJ: (a, b) + (c, d). Let (YJ^,)'YJJ be the test statistic defined 

by y;; = \Ti_^\''HKr"^(^n -c„X-d„y\ with a;; = dgc'|T;,_iiE„-lif„.i|T;_iii:;iidgc, 



dgc = n , n _1 ' r„ = 



^0,1 



10-1/' |T;|\p„„S°'' ci 



n+ 1 n 

^« = |Tr:r' i:teT-_,(^, +^2^+1)' a. = Y.kei^^^^iksik^u T:'» = e T„ : 52*52^+1 = l) and for all k e G„, 

S2k+\iX2k+\ - C„ - d„Xt,). 

Theorem 4.2. Under assumptions (AN.1-2), (AO), (AI) and the null hypothesis Hjj, one has (YJi)'YJ', — > X^i^) on 
{&, Pg); ant/ under the alternative hypothesis Hj, one has lim„_,oo l|YJj|p — +00 a.s. on (fi, Pg). 




Proof : We mimic the proof of Theorem 3.3 with g the function defined from from M'* onto M? by g(xi , X2, X3, X4) - 
(xi - X3,X2- X4y, SO that dgc is the gradient of g. □ 

Our second test compares the fixed points a/{l -b) and c/(l -d), which are the asymptotic means of X2k and X2k+\ 
respectively. Set H^: fl/(l -b) - c/(l-d) the symmetry hypothesis and Hj: a/(l-b) + c/(l -d). Let (Yl'f- be the test 
statistic defined by y{ = |T;_J1/2(^/)-i/2(^^/(i _c„/(1 -4)), where a{ = dgf'|r_j|i:;l;f„,i|T;„;|E;lidgf, 
dgf = (1/(1 - b),al(\ - fo)^, -1/(1 - d), -c/(l - t/)^)'. This test statistic has the following asymptotic properties. 

Theorem 4.3. Under assumptions (AN.1-2), (AO), (AI) and the null hypothesis Hjj, one has (f/)^ — — > x^i^) on 
{&, Pg); and under the alternative hypothesis Hj, one has lim„_,oo(}'/)^ = +00 a.s. on {&, Pg). 



Proof : We mimic the proof of Theorem 3.3 with g the function defined from onto M by g{xi,X2,xj,,X4) - 
{xi /(I - X2) - XT, Id - X4))', so that dgf is the gradient of g. □ 



5. Application to simulated data 

We now study the behavior of our three tests on simulated data. For each test, we compute, in function of 
the generation n and for different thresholds, the proportion of rejections under hypotheses Ho and Hi, the latter 
proportion being an indicator of the power of the test. Proportions are computed on a sample of 1000 repUcated trees. 

5.7. Asymmetry test for the Galton-Watson process 

In Table[T] we see that the observed proportions of p-values under the thresholds (0.05, 0.01, 0.001), are close to 
the expected proportions of rejection under H^^ suggesting that the asymptotic law of the statistic (Y^'*')^ is available 
by generation 8. Under H^^, the power of the test increases from 27.8 (%) for the generation 7 to 93.1 (%) for the 
generation 1 1 for a risk of type 1 fixed at 0.05. 

5.2. Asymmetry tests for the BAR process 

The first asymmetry test compares the parameters (a, c) and (c, d). In Table[2] we see that the observed proportions 
of p-values under the thresholds (0.05, 0.01, 0.001), are close to the expected proportions of rejection under HJj 
suggesting that the asymptotic law of the statistic ||Y^j|p is available at generation 8. Under Hj, the power of the test 
increases from 37.4 (%) for the generation 7 to 95.7 (%) for the generation 1 1 for a risk of type 1 fixed at 0.05. 
For the asymmetry test for the fixed points, we see in Table[3]that the observed proportions go away from the expected 
ones under Hj^ until the 10th generation, suggesting that the asymptotic law of the statistic is not reached before the 
10th generation. We also remark that the power is weak until the 10th generation. 
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Generation 


Under H^"' 


Under H^^ 




p < 0.05 


p < 0.01 


p < 0.001 


p < 0.05 


p < 0.01 


p < 0.001 


7 


6.4 


1.9 


0.3 


27.8 


11.8 


03.6 


8 


5.6 


1.4 


0.3 


44.2 


22.2 


07.6 


9 


5.5 


1.1 


0.3 


58.6 


38.5 


17.0 


10 


5.7 


1.5 


0.2 


79.4 


60.8 


35.9 


11 


4.8 


1.0 


0.1 


93.1 


82.0 


64.2 



Table 1: Proportions of p-values under the 0.05, 0.01 and 0.001 thresholds of the asymmelry tests for the means of 
the GW process (1000 replicas) po = (0.04, 0.08, 0.08, 0.8) (under (HI), pi = (0.15, 0.08, 0.08, 0.69)) 



Gen 


Under 


Under 




p < 0.05 


p < 0.01 


p < 0.001 


p < 0.05 


p < 0.01 


p < 0.001 


7 


6.6 


2.2 


0.6 


37.4 


19.7 


08.0 


8 


5.5 


1.5 


0.3 


53.6 


31.0 


14.6 


9 


5.5 


1.3 


0.3 


71.1 


52.3 


30.3 


10 


6.3 


1.2 


0.1 


86.8 


75.5 


56.1 


11 


5.9 


0.6 


0.1 


95.7 


90.8 


81.4 



Table 2: Proportions of p-values under the 0.05, 0.01 and 0.001 thresholds of the asymmetry test for the parameters 
of the BAR process (1000 rephcas) a = b = 0.5 (under (HI), c = 0.5;d = 0.4) 



Gen 


Under H' 


Under H*j 




p < 0.05 


p < 0.01 


p < 0.001 


p < 0.05 


p < 0.01 


p < 0.001 


7 


2.2 


0.7 





23.1 


07.4 


01.4 


8 


3.3 


0.5 


0.1 


41.3 


20.5 


06.1 


9 


3.8 


0.5 





64.6 


41.6 


18.6 


10 


4.7 


0.8 





82.9 


68.1 


46.3 


11 


5.5 


0.7 


0.1 


94.5 


88.5 


74.5 



Table 3: Proportions of p-values under the 0.05, 0.01 and 0.001 thresholds of the asymmetry test for the fixed points 
of the BAR process (1000 rephcas) a = b = 0.5 (under (HI), c = 0.5;d = 0.4) 
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6. Application to real data: aging detection of Escherichia coli 



To study aging in the single cell organism E. coli, Stewart et al. ( 2005[ l filmed 94 colonies of dividing cells, 
determining the complete lineage and the growth rate of each cell. E. coli is a rod-shaped bacterium that reproduces 
by dividing in the middle. Each cell inherits an old end or pole from its mother, and creates a new pole. Therefore, 
each cell has a type: old pole or new pole cell inducing asymmetry in the cell division. [Stewart et al.| ( |2003] ) propose 
a statistical study of the genealogy and pair-wise comparison of sister cells assuming independence between the pairs 
of sister cells which is not verified in the lineage. 

Figures [T 



2a 



2b 



present the results of our tests of the null hypotheses 



GW 



HJj and on the 5 1 data sets issued 



: i jm i iii nn iiiiii 

Figure 1: Histogram of the 51 p-values of the test 

of the 94 colonies containing at least eight or nine generations. Figure [T] shows that the hypotheses of equality of the 
expected number of observed offspring between two sisters is not rejected whatever the data set. This result is not 
surprising: in our sets, the data are missing most frequently because the cells were out of the range of the camera. The 




(a)H5 (b)H; 

Figure 2: Histogram of the 51 p-values of the tests of assumptions Hjj and HJj 

null hypotheses of the two tests on the BAR parameters are rejected for one set in four for HJj and for one in eight 
for Hq. a global conclusion on the difference between the old pole cell and the new pole cell is not easy. Regarding 
the results of the simulations in Tables |2] and |3] this lack of evidence is probably due to a low power of the tests at 
generations 8 and 9. Some data sets with more than 9 generations would probably show a more significant difiference. 
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